🫀Listen to Your Heart: A Disease Prediction 💔

¶

Context¶

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help

📌 Notebook Objectives

This notebook aims to:**¶

  • 📊 Perform Dataset exploration using various types of data analysis and data visualization techniques and save results in files in Excel format.
  • 🔍 Perform Feature Engineering to improve data, select the best features, and many more.
  • 📂 Perform Splitting of the dataset and save separate dataset files for training and testing.
  • 🛠️ Build Machine learning models that can predict patient's disease status.
  • 💾 Export prediction results on test data and save it in CSV format.
  • 📝 Perform predictions on new example data given and export theprediction result.
  • 💾 Save BEST Model and use it later for deployment.

Importing Libraries 📚

¶

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [2]:
# Loading, Preprocessing, Analysis Libraries
import numpy as np
import pandas as pd

# Visulaiztion Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
%matplotlib inline


# Model Training And Testing libraries
from sklearn.model_selection import train_test_split

# Model Algorithms Libraries
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

# Pre-Processing Libraries
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

# Metrics & Hyper Parameter Libraries
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, confusion_matrix, recall_score, accuracy_score, precision_score, f1_score, classification_report, f1_score, roc_curve 
from sklearn.model_selection import cross_val_score, GridSearchCV, RepeatedStratifiedKFold, KFold, StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

# Best Features Selection For Each Category Libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import chi2

# Profiling Libraries
from ydata_profiling import ProfileReport
import os

# Library For Model Comparison
from sklearn.model_selection import KFold

 

1 | Reading & Loading The Dataset 👓

This dataset is created manually. This dataset is created to predict whether a patient has heart disease or not. This dataset contains cardiatic information on patients and the diagnosis results of whether the patient has heart disease.

Machine learning models are necessary to determine whether a patient has heart disease and speed up the diagnostic process based on the medical information provided about that patient. The variables that most influence a patient to have heart disease will also be explored more deeply in this notebook.

In [3]:
# loading the csv data to a Pandas DataFrame
heart_data = pd.read_csv(r"C:\Users\acer\Downloads\IDS Project\Dataset\Heart.csv")

3 | Intial DataSet Exploration🔍

¶

3.1 | Profile Report of Dataset¶

¶

  1. Running Above Code Will Give You a Overview of Dataset With Each Feature, its count and its datatype along with correlations between different fesatures.
  2. It will also display some important details regarding dataset like counts of varibales, null values, duplicates and many more
  3. It also show correlations between different features
  4. ProfileReport will be used to generate report of our dataset
In [4]:
profile = ProfileReport(heart_data, title="Heart Disease Report", explorative=True)
profile
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[4]:

In [14]:
# Save the report to the specified path
profile.to_file(r"C:\Users\acer\Downloads\IDS Project\Dataset_Report.html")
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
Attribute Description Emoji
Age Age of the patient [Years] 👵👴
Sex Sex of the patient [M: Male, F: Female] 🚹🚺
ChestPainType Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] ❤️‍🩹
RestingBP Resting blood pressure [mm Hg] 💉
Cholesterol Serum cholesterol [mm/dl] 🩸
FastingBS Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] 🧁
RestingECG Resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality, LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria] 🩺
MaxHR Maximum heart rate achieved [Numeric value between 60 and 202] 💓
Exercise Angina Exercise-induced angina [Y: Yes, N: No] 🏃‍♂️🚫
Oldpeak oldpeak = ST [Numeric value measured in depression] 📉
ST_Slope The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] 📈
VCF Number of major blood vessels [Values: 0-3] 🔢
Smoking Smoking status [1: Yes (is a smoker), 0: No (is or is not a smoker)] 🚬🚭
Creatine Level of the CPK enzyme in the blood [mcg/L (0-2500)] 🧪
Thal Thalassemia status [Values ranging from 0-3] 🧬
HeartDisease Output class [1: Has Disease, 0: No Disease] ❤️‍🩹❤️

4 | Exploratory Data Analysis📈

¶

First Question should be why do we need this ??¶

OutCome of this phase is as given below :

  • Understanding the given dataset and helps clean up the given dataset.
  • It gives you a clear picture of the features and the relationships between them.
  • Providing guidelines for essential variables and leaving behind/removing non-essential variables.
  • Handling Missing values or human error.
  • Identifying outliers.
  • Finding Dulpicate Vlaues
  • EDA process would be maximizing insights of a dataset.
  • This process is time-consuming but very effective,

4.1 | Print first 5 rows of the dataset¶

In [8]:
# print first 5 rows of the dataset
heart_data.head()
Out[8]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR Exercise Agina Oldpeak ST_Slope VCF Smoking Creatine Thal HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 2 0 168 3 Y
1 49 F NAP 160 180 1 Normal 156 N 1.0 Flat 0 0 155 3 Y
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0 1 125 3 Y
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1 0 161 3 Y
4 54 M NAP 150 195 1 Normal 122 N 0.0 Up 3 0 106 2 Y

4.2 | Print Last 5 rows of the dataset¶

In [6]:
# print last 5 rows of the dataset
heart_data.tail()
Out[6]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR Exercise Agina Oldpeak ST_Slope ca smoking Creatine Diabetes thal HeartFailed
1395 53 F ASY 130 130 1 ST 120 Y 2.0 Flat 3 1 120 1 1 Y
1396 38 M ASY 138 138 0 LVH 139 Y 2.5 Up 3 1 139 1 1 Y
1397 53 F ATA 117 117 0 Normal 108 Y 2.0 Flat 0 1 108 1 1 Y
1398 62 M ATA 121 121 0 Normal 148 Y 2.5 Up 2 1 148 1 1 Y
1399 50 M TA 193 179 1 LVH 92 N 0.4 Flat 0 1 92 1 1 N

4.3 | Shape of Dataset¶

In [3]:
print("The shape of the dataset is : ")
heart_data.shape
The shape of the dataset is : 
Out[3]:
(1400, 16)

4.4 | Information of The Dataset¶

In [4]:
# getting some info about the data
heart_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             1400 non-null   int64  
 1   Sex             1400 non-null   object 
 2   ChestPainType   1400 non-null   object 
 3   RestingBP       1400 non-null   int64  
 4   Cholesterol     1400 non-null   int64  
 5   FastingBS       1400 non-null   int64  
 6   RestingECG      1400 non-null   object 
 7   MaxHR           1400 non-null   int64  
 8   Exercise Agina  1400 non-null   object 
 9   Oldpeak         1400 non-null   float64
 10  ST_Slope        1400 non-null   object 
 11  VCF             1400 non-null   int64  
 12  Smoking         1400 non-null   int64  
 13  Creatine        1400 non-null   int64  
 14  Thal            1400 non-null   int64  
 15  HeartDisease    1400 non-null   object 
dtypes: float64(1), int64(9), object(6)
memory usage: 175.1+ KB

4.5 | Displaying Count of Each Feature & Its Each Record In Dataset

In [7]:
def percent_counts(df, feature):
    total = df[feature].value_counts(dropna=False)
    percent = round(df[feature].value_counts(dropna=False, normalize=True) * 100, 2)
    percent_count = pd.concat([total, percent], keys=['Total', 'Percentage'], axis=1)
    return percent_count
In [21]:
# Path to save the file
output_path = r"C:\Users\acer\Downloads\IDS Project\Features_counts.xlsx"

# Create a Pandas Excel writer using XlsxWriter as the engine.
with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
    for feature in heart_data.columns:
        df_counts = percent_counts(heart_data, feature)
        
        # Convert the DataFrame to an XlsxWriter Excel object.
        df_counts.to_excel(writer, sheet_name=feature)

print(f"Report saved to {output_path}")
Report saved to C:\Users\acer\Downloads\IDS Project\Features_counts.xlsx

4.6 | Summary Statistics of Numerical Variables¶

In [23]:
def save_descriptive_statistics_excel(df, numeric_columns, output_path):
    # Create a Pandas Excel writer using XlsxWriter as the engine.
    with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
        for col in numeric_columns:
            stats = df[[col]].describe()
            stats.to_excel(writer, sheet_name=col)
        print(f"Descriptive statistics saved to {output_path}")

# List of numerical columns
num = heart_data.select_dtypes(include=['float64', 'int64'])
num
Out[23]:
Age RestingBP Cholesterol FastingBS MaxHR Oldpeak VCF Smoking Creatine Thal
0 40 140 289.0 0 172 0.0 2.0 0 168 3
1 49 160 180.0 0 156 1.0 0.0 0 155 3
2 37 130 283.0 0 98 0.0 0.0 1 125 3
3 48 138 214.0 0 108 1.5 1.0 0 161 3
4 54 150 195.0 0 122 0.0 2.5 0 106 2
... ... ... ... ... ... ... ... ... ... ...
1395 53 130 130.0 0 120 2.0 2.5 1 120 1
1396 38 138 138.0 0 139 2.5 2.5 1 139 1
1397 53 117 117.0 0 108 2.0 0.0 1 108 1
1398 62 121 121.0 0 148 2.5 2.0 1 148 1
1399 50 170 179.0 0 92 0.4 0.0 1 92 1

1400 rows × 10 columns

In [23]:
# List of numerical columns
num = heart_data.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Path to save the file
output_path = r"C:\Users\acer\Downloads\IDS Project\Numerical_stats.xlsx"

# Call the function to save descriptive statistics
save_descriptive_statistics_excel(heart_data, num, output_path)
Descriptive statistics saved to C:\Users\acer\Downloads\IDS Project\Numerical_stats.xlsx

4.6 | Summary Statistics of Categorical Variables¶

In [18]:
heart_data.select_dtypes(include=['object']).describe()
Out[18]:
Sex ChestPainType RestingECG Exercise Agina ST_Slope HeartDisease
count 1400 1400 1400 1400 1400 1400
unique 2 4 3 2 3 2
top M ASY Normal N Flat Y
freq 856 760 780 815 709 700

4.7 | Balancing The Target Variable¶

In [14]:
percent_counts(heart_data, "HeartDisease")
Out[14]:
Total Percentage
Y 709.0 50.64
N 684.0 48.86
NaN 7.0 0.50
NaN 7.0 0.50
In [16]:
def get_null_indices(df, feature):
    null_indices = df[df[feature].isnull()].index
    return null_indices

null_indices = get_null_indices(heart_data, "HeartDisease")
null_indices
Out[16]:
Int64Index([1385, 1386, 1387, 1388, 1389, 1390, 1391], dtype='int64')
In [17]:
percent_counts(heart_data, "HeartDisease")
Out[17]:
Total Percentage
Y 700.0 50.0
N 700.0 50.0
NaN 0.0 0.0

4.8 | Selecting Continuous and Categorical Features

In [5]:
continuous_values = []
categorical_values = []

for column in data.columns:
    if data[column].dtype == 'int64' or data[column].dtype == 'float64':
        continuous_values.append(column)
    else:
        categorical_values.append(column)

4.8.1 | Categorical Variables¶

In [24]:
categorical_values
Out[24]:
['Sex',
 'ChestPainType',
 'RestingECG',
 'Exercise Agina',
 'ST_Slope',
 'HeartDisease']

4.8.2 | Numerical Variables¶

In [16]:
continuous_values
Out[16]:
['Age',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'MaxHR',
 'Oldpeak',
 'ca',
 'smoking',
 'Creatine',
 'Diabetes',
 'thal']

4.9 | Handling Null Values

As From The Profile Report Above, it is clear that the dataset does not contain Missing Values , but we will still check it for confirmation¶

In [8]:
# checking for missing values
print("Missing values: \n")
heart_data.isnull().sum()

### Good, we did not find any null value in the dataset
Missing values: 

Out[8]:
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
Exercise Agina    0
Oldpeak           0
ST_Slope          0
VCF               0
Smoking           0
Creatine          0
Thal              0
HeartDisease      0
dtype: int64

4.10 | Handling Duplicate Values

As From The Profile Report Above, it is clear that the dataset does not contain Duplicates Values , but we will still check it for confirmation¶

In [12]:
data_dup = heart_data.duplicated().any()
print(data_dup)
heart_data = heart_data.drop_duplicates()
print("After removing duplicates: \n")
heart_data.shape

#### As the answer is False, So it meant that my Dataset does not contain DUPLICATE VALUES
False
After removing duplicates: 

Out[12]:
(1400, 16)

4.11 | Finding Unique Values In Each Column

In [11]:
dict = {}
for i in list(heart_data.columns):
    dict[i] = heart_data[i].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).transpose()
Out[11]:
unique count
Age 50
Sex 2
ChestPainType 4
RestingBP 69
Cholesterol 266
FastingBS 2
RestingECG 3
MaxHR 120
Exercise Agina 2
Oldpeak 53
ST_Slope 3
VCF 5
Smoking 2
Creatine 269
Thal 4
HeartDisease 2

4.12 | Outlier Detection

In [16]:
def outlier_detect(df, col):
    q1_col = Q1[col]
    iqr_col = IQR[col]
    q3_col = Q3[col]
    return df[((df[col] < (q1_col - 1.5 * iqr_col)) |(df[col] > (q3_col + 1.5 * iqr_col)))]

# ---------------------------------------------------------
def outlier_detect_normal(df, col):
    m = df[col].mean()
    s = df[col].std()
    return df[((df[col]-m)/s).abs()>3]

# ---------------------------------------------------------
def lower_outlier(df, col):
    q1_col = Q1[col]
    iqr_col = IQR[col]
    q3_col = Q3[col]
    lower = df[(df[col] < (q1_col - 1.5 * iqr_col))]
    return lower

# ---------------------------------------------------------
def upper_outlier(df, col):
    q1_col = Q1[col]
    iqr_col = IQR[col]
    q3_col = Q3[col]
    upper = df[(df[col] > (q3_col + 1.5 * iqr_col))]
    return upper

# ---------------------------------------------------------
def replace_upper(df, col):
    q1_col = Q1[col]
    iqr_col = IQR[col]
    q3_col = Q3[col]
    tmp = 9999999
    upper = q3_col + 1.5 * iqr_col
    df[col] = df[col].where(lambda x: (x < (upper)), tmp)
    df[col] = df[col].replace(tmp, upper)
    print('outlier replace with upper bound - {}' .format(col)) 
    
# ---------------------------------------------------------
def replace_lower(df, col):
    q1_col = Q1[col]
    iqr_col = IQR[col]
    q3_col = Q3[col]
    tmp = 1111111
    lower = q1_col - 1.5 * iqr_col
    df[col] = df[col].where(lambda x: (x > (lower)), tmp)
    df[col] = df[col].replace(tmp, lower)
    print('outlier replace with lower bound - {}' .format(col)) 
# ---------------------------------------------------------

# Writing Formulas For Upper & Lower Quartiles
Q1 = heart_data.quantile(0.25, numeric_only=True)
Q3 = heart_data.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1


### <b><span style='color:#C40C0C'>6.5.2 </span> | Finding Variables With Outliers Values</b>
for i in range(len(continuous_values)):
    print("IQR => {}: {}".format(continuous_values[i], outlier_detect(heart_data, continuous_values[i]).shape[0]))
    print("Z_Score => {}: {}".format(continuous_values[i], outlier_detect_normal(heart_data, continuous_values[i]).shape[0]))
    print("********************************")
    
IQR => Age: 0
Z_Score => Age: 0
********************************
IQR => RestingBP: 42
Z_Score => RestingBP: 12
********************************
IQR => Cholesterol: 12
Z_Score => Cholesterol: 8
********************************
IQR => FastingBS: 217
Z_Score => FastingBS: 0
********************************
IQR => MaxHR: 4
Z_Score => MaxHR: 0
********************************
IQR => Oldpeak: 19
Z_Score => Oldpeak: 18
********************************
IQR => VCF: 126
Z_Score => VCF: 27
********************************
IQR => Smoking: 0
Z_Score => Smoking: 0
********************************
IQR => Creatine: 159
Z_Score => Creatine: 16
********************************
IQR => Thal: 0
Z_Score => Thal: 0
********************************

4.12.3 | Displaying Only Features With Outliers¶

In [18]:
outlier = []
for i in range(len(continuous_values)):
    if outlier_detect(heart_data[continuous_values],continuous_values[i]).shape[0] !=0:
        outlier.append(continuous_values[i])

print("Numerical Variables With Outlier Values : ")
outlier
Numerical Variables With Outlier Values : 
Out[18]:
['RestingBP',
 'Cholesterol',
 'FastingBS',
 'MaxHR',
 'Oldpeak',
 'VCF',
 'Creatine']

4.12.4 | Replacing Outliers Values¶

In [19]:
for i in range(len(outlier)):
    replace_upper(heart_data, outlier[i]) 
    
print("\n********************************\n")
for i in range(len(outlier)):
    replace_lower(heart_data, outlier[i])
    
### As you can see now there not any outlier in numerical features of the dataset.
outlier replace with upper bound - RestingBP
outlier replace with upper bound - Cholesterol
outlier replace with upper bound - FastingBS
outlier replace with upper bound - MaxHR
outlier replace with upper bound - Oldpeak
outlier replace with upper bound - VCF
outlier replace with upper bound - Creatine

********************************

outlier replace with lower bound - RestingBP
outlier replace with lower bound - Cholesterol
outlier replace with lower bound - FastingBS
outlier replace with lower bound - MaxHR
outlier replace with lower bound - Oldpeak
outlier replace with lower bound - VCF
outlier replace with lower bound - Creatine
In [21]:
outlier = []
for i in range(len(continuous_values)):
    if outlier_detect(heart_data[continuous_values],continuous_values[i]).shape[0] !=0:
        outlier.append(continuous_values[i])

print("Numerical Variables With Outlier Values : ")
outlier
Numerical Variables With Outlier Values : 
Out[21]:
[]

Exploratory Data Analysis (EDA) Summary 📊

1. Memory Usage 💾¶

  • Total memory usage: 175.1 KB

2. Dataset Overview🗂️¶

  • Dataset Size: The dataset contains 1400 records and 16 features.
  • Feature Diversity: The features exhibit a range of unique values, indicating diversity in the dataset.

3. Data Quality Check 🛠️¶

  • Duplicates Removed: After removing duplicates, the dataset retains its original size of (1400, 16).
  • Missing Values: No missing values were detected in any of the features, suggesting good data completeness.

4. Outlier Detection 🚨¶

  • Outlier-Prone Features: Numerical variables such as RestingBP, Cholesterol, MaxHR, Oldpeak, VCF, and Creatine exhibit outlier values.
  • Outlier Detection Methods: Outliers were identified using both Interquartile Range (IQR) and Z-Score methods.
  • Detection Results: While some features showed significant outliers (e.g., RestingBP, Cholesterol), others had minimal or no outliers (e.g., Smoking, Thal).

5. Summary 🌟¶

  • Sex distribution shows a higher number of males (856) compared to females (544).
  • The most commontype of chest pain is Asymptomatic (ASY) with 760 occurrences.
  • The majority of patients have a Normal resting ECG (780).
  • Exercise-induced angina is less common, with "No" responses being more frequent (815).
  • The ST slope is most commonly Flat (709).
  • There is an equal distribution of heart disease presence (Yes: 700, No: 700).

Conclusion 📝¶

  • Categorical Columns: The insights derived from the categorical columns help in understanding the prevalence of different conditions and their relationship with heart disease.
  • Data Integrity: The dataset appears to be clean and of good quality, with no missing values and duplicates addressed.
  • Outlier Awareness: Identification of outliers is crucial for maintaining data integrity and ensuring accurate analysis and modeling results.
  • Further Analysis: Exploring relationships between features and the target variable (HeartDisease) can provide valuable insights for predictive modeling tasks.

By conducting thorough EDA, we establish a solid foundation for subsequent data preprocessing and modeling stages, enhancing the effectiveness and reliability of our analytical processes.¶

5 | Visualization

¶

5.1 | Visualizing Numerical Feaatures

In [28]:
# Create subplots with 5 rows and 2 columns
fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(19, 21))

# Define colors for plots
colors = ['#4D3425', '#E4512B', '#5A9BD4', '#FFD700', '#4CAF50', '#F08080', '#808000', '#87CEEB', '#9370DB', '#20B2AA', '#8B4513']  

# Columns to visualize
columns_to_visualize = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak', 'VCF', 'Smoking', 'Creatine', 'Thal']

# Plot histograms for each column
for i in range(9):
    plt.subplot(5, 2, i + 1)  # Adjust subplot index
    current_color = colors[i % len(colors)]  
    sns.histplot(heart_data[columns_to_visualize[i]], kde=True, bins=20, color=current_color, edgecolor='black')
    plt.title(f'Distribution of {columns_to_visualize[i]}')
    plt.xlabel(columns_to_visualize[i])

# Remove the last subplot
plt.delaxes(ax[4, 1])  
plt.tight_layout()
plt.show()

6.2 | Visualizing All Categorical Features

In [22]:
plt.rcParams['axes.facecolor'] = '#f6f5f5'

# Define color palette
color_palette = ["#800000", "#8000ff", "#6aac90", "#5833ff", "#da8829"]

# Create figure and axes
fig, axs = plt.subplots(3, 2, figsize=(17, 17))

# Define categorical columns
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'Exercise Agina', 'ST_Slope']

# Plot count plots for each categorical column
for column, ax in zip(categorical_columns, axs.flatten()):
    sns.countplot(x=column, data=heart_data, palette=color_palette, ax=ax)
    ax.set_xlabel('')
    ax.set_ylabel('Count')
    ax.set_title(column)

# Remove unused subplots
for i in range(len(categorical_columns), axs.size):
    axs.flatten()[i].axis('off')

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()

5.3 | Visula of TARGET Variable Relation To Features

5.3.1 | Visualizing the TARGET Feature Against Numerical Features¶

In [6]:
import plotly.express as px

# Define a template
temp = go.layout.Template(
    layout=go.Layout(
        title_font=dict(family="Arial", size=21, color="black"),
        legend=dict(font=dict(family="Arial", size=12)),
        # Add more layout settings as needed
    )
)

# Scatter plot for patients with and without heart disease
fig_combined = px.scatter_matrix(heart_data,
                                 dimensions=["Age", "Cholesterol", "RestingBP", "MaxHR", "Oldpeak"],
                                 title='Features Comparison for Patients with and without Heart Disease',
                                 color='HeartDisease', symbol='HeartDisease',
                                 color_discrete_sequence=["#FFDAB9", "#8B0000"],
                                 symbol_sequence=["circle", "circle"],
                                 template=temp)

# Update marker attributes
fig_combined.update_traces(marker=dict(size=15, opacity=.7, line_width=1), 
                           diagonal_visible=False, showupperhalf=False)

# Update layout to increase the size of the plots and add custom legend
fig_combined.update_layout(height=1000, width=1000,
                           legend=dict(
                               title="Heart Disease",
                               orientation="h",
                               yanchor="bottom",
                               y=1.02,
                               xanchor="right",
                               x=1
                           ))

# Show the combined plot
fig_combined.show()

5.3.2 | Visualizing Categorical Variables Against TARGET¶

In [7]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Group data by categories
categories = ['ChestPainType', 'Sex', 'RestingECG', 'Exercise Agina', 'ST_Slope']
figs = []

# Create a figure for each category
for category in categories:
    grouped_data = heart_data.groupby(['HeartDisease', category]).size().unstack(fill_value=0)
    colors = ['#BE6B6B', '#FF9999', '#C1D2D1', '#598885', '#E5BAB4']  # Define colors
    fig = go.Figure()

    # Add traces for each category type
    for i, cat_type in enumerate(grouped_data.columns):
        fig.add_trace(go.Bar(x=grouped_data.index, y=grouped_data[cat_type], name=cat_type, marker_color=colors[i]))

    # Update layout
    fig.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text=category, height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
    fig.update_xaxes(title_text="Heart Disease")
    fig.update_yaxes(title_text="Frequency")
    figs.append(fig)

# Show plots
for fig in figs:
    fig.show()

5.4 | Visulization of Target Variable ["Heart DIsease"]

In [6]:
# Define colors and calculate percentages for the pie chart
colors = ["#8B0000", "#FFDAB9", "#8B008B", "#FF8C00"]
counts = heart_data['HeartDisease'].value_counts()
percentages = [counts[1] / counts.sum() * 100, counts[0] / counts.sum() * 100]

# Create figure and axes
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))  # Adjusted figsize for better readability

# Plot the pie chart
ax[0].pie(percentages, labels=['Heart Disease', 'No Heart Disease'], autopct='%1.1f%%', startangle=90,
          explode=(0.1, 0), colors=colors[:2], wedgeprops={'edgecolor': 'black', 'linewidth': 1, 'antialiased': True})
ax[0].set_title('Heart Disease Distribution (%)')

# Plot the count plot
sns.countplot(x='HeartDisease', data=heart_data, palette=colors[:2], ax=ax[1])
ax[1].set_title('Cases of Heart Disease')
ax[1].set_xlabel('Heart Disease')
ax[1].set_ylabel('Count')
ax[1].set_xticklabels(['No Heart Disease', 'Heart Disease'])

# Adjust layout
fig.tight_layout(pad=3)
plt.show()

5.5 | Correlation Relationships

Its necessary to remove correlated variables to improve your model.One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using plotly express.¶

  • Lighter shades represents positive correlation
  • Darker shades represents negative correlation
In [15]:
# Compute descriptive statistics for individuals with and without heart disease
heart_disease = heart_data[heart_data['HeartDisease'] == 'Y'].describe().T
no_heart_disease = heart_data[heart_data['HeartDisease'] == 'N'].describe().T

# Create a figure with two subplots side-by-side
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 8))  # Increased figsize for better readability

# Plot heatmap for individuals with heart disease
sns.heatmap(heart_disease[['mean']], annot=True, cmap='YlOrRd', linewidths=0.4, linecolor='black', cbar=False, fmt='.2f', ax=ax[0])
ax[0].set_title('Heart Disease')

# Plot heatmap for individuals without heart disease
sns.heatmap(no_heart_disease[['mean']], annot=True, cmap='YlGnBu', linewidths=0.4, linecolor='black', cbar=False, fmt='.2f', ax=ax[1])
ax[1].set_title('No Heart Disease')

# Adjust layout for better spacing
fig.tight_layout(pad=3)
plt.show()

6.5.1 | Correlation of Continuous Features¶

In [9]:
df_corr= heart_data[continuous_values].corr()
df_corr
Out[9]:
Age RestingBP Cholesterol FastingBS MaxHR Oldpeak VCF Smoking Creatine Thal
Age 1.000000 0.187222 -0.071334 0.021653 -0.251494 0.142371 0.017135 0.004155 0.112963 -0.038278
RestingBP 0.187222 1.000000 0.138005 0.053381 -0.074368 0.096835 -0.004460 0.016034 0.032899 0.019768
Cholesterol -0.071334 0.138005 1.000000 -0.010789 0.155643 0.060347 -0.036740 -0.065205 0.089271 0.315564
FastingBS 0.021653 0.053381 -0.010789 1.000000 -0.027226 -0.032672 0.107413 0.013005 0.011581 0.011035
MaxHR -0.251494 -0.074368 0.155643 -0.027226 1.000000 -0.085468 0.000727 -0.059284 -0.005993 0.043660
Oldpeak 0.142371 0.096835 0.060347 -0.032672 -0.085468 1.000000 0.011560 0.024903 0.044719 0.068832
VCF 0.017135 -0.004460 -0.036740 0.107413 0.000727 0.011560 1.000000 0.028354 0.058569 0.022094
Smoking 0.004155 0.016034 -0.065205 0.013005 -0.059284 0.024903 0.028354 1.000000 -0.020810 -0.012151
Creatine 0.112963 0.032899 0.089271 0.011581 -0.005993 0.044719 0.058569 -0.020810 1.000000 0.125868
Thal -0.038278 0.019768 0.315564 0.011035 0.043660 0.068832 0.022094 -0.012151 0.125868 1.000000
In [12]:
plt.figure(figsize=(19,7))
sns.heatmap(df_corr, annot = True, cmap = 'YlGnBu')
plt.title('Correlation Matrix of Continuous Variables')
plt.show()

🔍 Correlation Insights¶

  1. Heart Disease Correlations:

    • 💓 Exercise Agina (0.81) : Strongly positively correlated. Individuals experiencing angina during exercise are more likely to have heart disease.
    • 💔 Chest Pain Type (-0.17) : Moderately negatively correlated. Certain types of chest pain may be associated with a lower risk of heart disease.
  2. Other Variable Correlations with Heart Disease:

    • ❤️ MaxHR (-0.22): Moderately negatively correlated. Individuals with lower maximum heart rates during exercise are more likely to have heart disease.
    • 📉 ST_Slope (-0.22): Moderately negatively correlated. Certain patterns in the ST segment during exercise may indicate a lower risk of heart disease.
    • 📈 Resting ECG (0.05): Weakly positively correlated. Abnormalities in resting electrocardiographic results may slightly increase the likelihood of heart disease.
  3. Other Variable Correlations:

    • 👴🏻🔝 Age and Cholesterol (0.32): Moderately positively correlated. Older individuals tend to have higher cholesterol levels.
    • 🩺 Cholesterol and Thal (0.32): Moderately positively correlated. Certain types of thalassemia may influence cholesterol levels.

6 | Feature Engineering⚙️

¶

The concepts that I will cover in this article are

-- Feature Engineering Part:¶

  1. Handling Categorical Variables
  2. Feature Scaling
  3. Feature Extraction

6.1 | Handling Categorical Variables

In [36]:
categorical_values
Out[36]:
['Sex',
 'ChestPainType',
 'RestingECG',
 'Exercise Agina',
 'ST_Slope',
 'HeartDisease']

6.1.1 | Categorical Variables Distribution Before Label Encoding¶

In [32]:
plt.figure(figsize=(20, 15))

num_cols = len(heart_data.columns)
num_rows = (num_cols // 4) + (num_cols % 4 > 0)  

for i, col in enumerate(heart_data.columns, 1):
    plt.subplot(num_rows, 4, i)
    plt.title(f"Distribution of {col} Data")
    sns.histplot(heart_data[col], kde=True, color = 'green', alpha = 0.5)
    plt.tight_layout()

plt.show()

6.1.2 | Label Encoding¶

In [13]:
# Create a LabelEncoder object
le = LabelEncoder()
df1 = heart_data.copy(deep=True)

# Apply Label Encoding using a loop
for col in categorical_values:
    df1[col] = le.fit_transform(df1[col]).astype('int64')

df1.head()
Out[13]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR Exercise Agina Oldpeak ST_Slope VCF Smoking Creatine Thal HeartDisease
0 40 1 1 140 289 0 1 172 0 0.0 2 2 0 168 3 1
1 49 0 2 160 180 1 1 156 0 1.0 1 0 0 155 3 1
2 37 1 1 130 283 0 2 98 0 0.0 2 0 1 125 3 1
3 48 0 0 138 214 0 1 108 1 1.5 1 1 0 161 3 1
4 54 1 2 150 195 1 1 122 0 0.0 2 3 0 106 2 1
In [32]:
df = df1[categorical_values].corr()
df
Out[32]:
Sex ChestPainType RestingECG Exercise Agina ST_Slope HeartDisease
Sex 1.000000 0.005235 0.042704 0.078186 -0.014215 0.073271
ChestPainType 0.005235 1.000000 -0.022056 -0.220928 0.133794 -0.165447
RestingECG 0.042704 -0.022056 1.000000 0.087585 -0.035039 0.054045
Exercise Agina 0.078186 -0.220928 0.087585 1.000000 -0.267247 0.812468
ST_Slope -0.014215 0.133794 -0.035039 -0.267247 1.000000 -0.218887
HeartDisease 0.073271 -0.165447 0.054045 0.812468 -0.218887 1.000000
In [27]:
# Calculate correlations excluding the 'HeartDisease' column
corr = df1.drop('HeartDisease', axis=1).corrwith(df1['HeartDisease']).sort_values(ascending=False).to_frame()
corr.columns = ['Correlations']

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr, annot=True, annot_kws={"size": 11.5}, fmt='.2f', cmap='RdBu_r', center=0, linewidths=0.1, alpha=0.9)
plt.title('Correlation with Heart Disease (excluding HeartDisease)')
plt.xticks(rotation=0)
plt.yticks(rotation=0)
plt.show()

6.1.3 | Info of Dataset After Encoding¶

In [35]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1400 entries, 0 to 1399
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             1400 non-null   int64  
 1   Sex             1400 non-null   int64  
 2   ChestPainType   1400 non-null   int64  
 3   RestingBP       1400 non-null   int64  
 4   Cholesterol     1400 non-null   int64  
 5   FastingBS       1400 non-null   int64  
 6   RestingECG      1400 non-null   int64  
 7   MaxHR           1400 non-null   int64  
 8   Exercise Agina  1400 non-null   int64  
 9   Oldpeak         1400 non-null   float64
 10  ST_Slope        1400 non-null   int64  
 11  VCF             1400 non-null   int64  
 12  Smoking         1400 non-null   int64  
 13  Creatine        1400 non-null   int64  
 14  Thal            1400 non-null   int64  
 15  HeartDisease    1400 non-null   int64  
dtypes: float64(1), int64(15)
memory usage: 185.9 KB

As You Can See All The Object Type Columns Have Been Enoded Into Integer Type.

6.1.4 | Categorical Variables Distribution After Label Encoding¶

In [10]:
plt.figure(figsize=(17, 15))

num_cols = len(df1.columns)
num_rows = (num_cols // 4) + (num_cols % 4 > 0)  

for i, col in enumerate(df1.columns, 1):
    plt.subplot(num_rows, 4, i)
    plt.title(f"Distribution of {col} Data")
    sns.histplot(df1[col], kde=True, color = 'Darkred', alpha = 0.5)
    plt.tight_layout()

plt.show()

6.1.5 | Storing Categorical Columns Value Into A Separate Excel File After Encoding¶

In [108]:
encoded_categorical_columns = df1[['Sex', 'ChestPainType', 'RestingECG', 'Exercise Agina', 'ST_Slope', 'HeartDisease']]

excel_path = r"C:\Users\acer\Downloads\IDS Project\Categorical_Encoding.xlsx"

encoded_categorical_columns.to_excel(excel_path, index=False)
print("Excel File Created successfully!")
Excel File Created successfully!

6.2 | Feature Scaling

Why Feature Scaling Is Needed¶

  1. Feature scaling is necessary in machine learning to ensure that all features contribute equally to the model's performance and to prevent any single feature from dominating the learning algorithm.
  2. This is particularly important for algorithms that use distance-based calculations to make predictions.

#¶

6.2.1 | Normalizing The Features¶

In [14]:
# Create scaler objects
mms = MinMaxScaler()  # For Min-Max scaling (Normalization)
df2 = df1.copy(deep=True)

# Apply scaling using a loop
for col in continuous_values:
    df2[col] = mms.fit_transform(df2[[col]]) 

df2.head()
Out[14]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR Exercise Agina Oldpeak ST_Slope VCF Smoking Creatine Thal HeartDisease
0 0.244898 1 1 0.70 0.479270 0.0 1 0.788732 0 0.295455 2 0.50 0.0 0.052328 1.000000 1
1 0.428571 0 2 0.80 0.298507 1.0 1 0.676056 0 0.409091 1 0.00 0.0 0.047636 1.000000 1
2 0.183673 1 1 0.65 0.469320 0.0 2 0.267606 0 0.295455 2 0.00 1.0 0.036810 1.000000 1
3 0.408163 0 0 0.69 0.354892 0.0 1 0.338028 1 0.465909 1 0.25 0.0 0.049802 1.000000 1
4 0.530612 1 2 0.75 0.323383 1.0 1 0.436620 0 0.295455 2 0.75 0.0 0.029953 0.666667 1

6.3.2 | Saving Normalized Features Into A CSV File¶

In [10]:
# Define the path to save the CSV file
csv_path = r"C:\Users\acer\Downloads\IDS Project\Normalization.csv"

# Save the DataFrame with scaled columns to a CSV file
df2.to_csv(csv_path, index=False)

print("CSV File Created successfully!")
CSV File Created successfully!

6.8 | Feature Extraction

6.8.1 | Finding Best Categorical Features¶

In [62]:
# Use all columns in categorical_values, excluding the target column
features = df2[categorical_values].drop(columns=['HeartDisease'])

X = features.iloc[:, :]   
y = heart_data['HeartDisease']     

# Applying SelectKBest with chi-squared test
best_features = SelectKBest(score_func=chi2, k='all')
fit = best_features.fit(X, y)

# Creating a DataFrame to store the chi-squared scores
featureScores = pd.DataFrame(data={'Feature': X.columns, 'Chi Squared Score': fit.scores_})

# Sort the features by their chi-squared scores in descending order
featureScores = featureScores.sort_values(by='Chi Squared Score', ascending=False)

# Print selected features and their scores
print("Selected Features and Chi Squared Scores:")
print(featureScores)

# Plotting the chi-squared scores
plt.subplots(figsize=(5, 5))
sns.heatmap(featureScores.set_index('Feature'), annot=True, linewidths=0.4, linecolor='black', fmt='.2f')
plt.title('Selection of Categorical Features (Excluding HeartFailed)')
plt.show()
Selected Features and Chi Squared Scores:
          Feature  Chi Squared Score
3  Exercise Agina         537.984615
1   ChestPainType          44.181818
4        ST_Slope          19.091929
0             Sex           2.920561
2      RestingECG           1.937984

6.8.1 | Finding Best Numerical Features¶

In [67]:
# Separating features and target variable
X_continuous = df2[continuous_values]
y_continuous = df2['HeartDisease']

# Applying ANOVA
f_values, p_values = f_classif(X_continuous, y_continuous)

# Creating a DataFrame to store the results
anova_results = pd.DataFrame(data={'F-value': f_values, 'p-value': p_values}, index=X_continuous.columns)

# Displaying the results
print("ANOVA Results:")
print(anova_results)

# Plotting the results
plt.figure(figsize=(8, 5))
sns.barplot(x=anova_results['F-value'], y=anova_results.index, palette='coolwarm')
plt.title('ANOVA F-values')
plt.xlabel('F-value')
plt.ylabel('Features')
plt.show()
ANOVA Results:
                F-value       p-value
Age           24.123926  1.009666e-06
RestingBP      4.136714  4.215104e-02
Cholesterol    0.140297  7.080425e-01
FastingBS      1.968707  1.608072e-01
MaxHR         30.831145  3.364223e-08
Oldpeak      109.160511  1.178386e-24
VCF            0.188899  6.639020e-01
Smoking        0.026966  8.695862e-01
Creatine       6.124673  1.344867e-02
Thal           0.017430  8.949856e-01

📊 Chi Squared Scores¶

Insight:

  • 🏃‍♂️ Exercise Angina shows the strongest association with the target variable.
  • 💔 Chest Pain Type follows with moderate association.
  • ⛰️ ST Slope has a moderate association.
  • 👫 Sex and 📈 Resting ECG exhibit weaker associations.

-------------------------------------------------------------------------------------------------------------------------------¶

📈 ANOVA Results¶

Insight

  • ⏫ Oldpeak has the highest F-value and extremely low p-value, indicating its strong influence on the target.
  • 💓 Max HR and 🎂 Age also demonstrate `significant impacts on the target.
  • 📉 Creatine shows moderate significance.
  • 💉 Resting BP and 🍔 Cholesterol have relatively lower F-values and higher p-values, suggesting weaker influences.

-------------------------------------------------------------------------------------------------------------------------------¶

Summarized Insights:¶

  1. 🏃‍♂️ Exercise Angina
  2. 💔 Chest Pain Type
  3. ⛰️ ST Slope
  4. 👴 Oldpeak
  5. 💓 Max HR
  6. 🎂 Age

are key predictors of the target variable.

  • 👫 Sex, 📈 Resting ECG, 🍔 Cholesterol, 🍽️ Fasting BS, 🌊 VCF, 🚭 Smoking, and 🧪 Thal show weaker associations.

Consider prioritizing features with higher Chi Squared scores and lower p-values in ANOVA for predictive modeling.¶


7 | Data Separating and Splitting 🪓

¶

7.1 | Data Separation

In [15]:
selected_features = ['Age', 'ChestPainType', 'MaxHR' ,'Exercise Agina', 'Oldpeak',  'ST_Slope']
In [18]:
# Extract the selected features from the DataFrame
features = df2[selected_features].values
target = df2['HeartDisease'].values

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)
In [19]:
# Extract the selected features from the DataFrame
X = df2.drop(columns='HeartDisease', axis=1).values
Y = df2['HeartDisease'].values

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

# Initialize and fit SelectKBest
Kbest_classif = SelectKBest(score_func=f_classif, k=6)
Kbest_classif.fit(x_train, y_train)

# Print the scores for the features
for i in range(len(Kbest_classif.scores_)):
    print(f'Feature {i} : {round(Kbest_classif.scores_[i], 3)}')


# Plot the feature scores
plt.bar(df2.drop(columns='HeartDisease').columns, Kbest_classif.scores_)
plt.xticks(rotation=90)
plt.rcParams["figure.figsize"] = (8, 6)
plt.show()
Feature 0 : 18.626
Feature 1 : 6.584
Feature 2 : 29.905
Feature 3 : 0.959
Feature 4 : 0.192
Feature 5 : 3.668
Feature 6 : 6.748
Feature 7 : 22.826
Feature 8 : 1977.81
Feature 9 : 80.13
Feature 10 : 41.493
Feature 11 : 0.682
Feature 12 : 0.119
Feature 13 : 4.775
Feature 14 : 0.155
In [20]:
# transform training set
x_train_classif = Kbest_classif.transform(x_train)
print("X_train.shape: {}".format(x_train.shape))
print()
print("X_train_selected.shape: {}".format(x_train_classif.shape))
print()
# transform test data
x_test_classif = Kbest_classif.transform(x_test)
X_train.shape: (1120, 15)

X_train_selected.shape: (1120, 6)

In [21]:
# Get the selected feature indices
selected_feature_indices = Kbest_classif.get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = df2.drop(columns='HeartDisease').columns[selected_feature_indices]

# Display the column names
print("Selected feature names:")
print(selected_feature_names)
Selected feature names:
Index(['Age', 'ChestPainType', 'MaxHR', 'Exercise Agina', 'Oldpeak',
       'ST_Slope'],
      dtype='object')

7.1.1 | Shape of Train & Test Data

In [22]:
print("Training set features shape:", x_train_classif.shape)
print("Testing set features shape:", x_test_classif.shape)
print("Training set target shape:", y_train.shape)
print("Testing set target shape:", y_test.shape)
Training set features shape: (1120, 6)
Testing set features shape: (280, 6)
Training set target shape: (1120,)
Testing set target shape: (280,)

7.1.2 | Saving Training and Testing Data Into CSV File With Feature Names¶

In [23]:
# Extract the selected features from the DataFrame
features = df2[selected_features]
target = df2['HeartDisease']

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)
# Save the training features
x_train_path = r'C:\Users\acer\Downloads\IDS Project\x_train.csv'
if os.path.exists(x_train_path):
    os.remove(x_train_path)
x_train_df = pd.DataFrame(data=x_train_classif)
x_train_df.to_csv(x_train_path, index=False)
print("Training features saved successfully!")

# Save the testing features
x_test_path = r'C:\Users\acer\Downloads\IDS Project\x_test.csv'
if os.path.exists(x_test_path):
    os.remove(x_test_path)
x_test_df = pd.DataFrame(data=x_test_classif)
x_test_df.to_csv(x_test_path, index=False)
print("Testing features saved successfully!")

# Save the training target
y_train_path = r'C:\Users\acer\Downloads\IDS Project\y_train.csv'
if os.path.exists(y_train_path):
    os.remove(y_train_path)
y_train_df = pd.DataFrame(data=y_train, columns=['HeartDisease'])
y_train_df.to_csv(y_train_path, index=False)
print("Training target saved successfully!")

# Save the testing target
y_test_path = r'C:\Users\acer\Downloads\IDS Project\y_test.csv'
if os.path.exists(y_test_path):
    os.remove(y_test_path)
y_test_df = pd.DataFrame(data=y_test, columns=['HeartDisease'])
y_test_df.to_csv(y_test_path, index=False)
print("Testing target saved successfully!")
Training features saved successfully!
Testing features saved successfully!
Training target saved successfully!
Testing target saved successfully!

8 | Model Training & Implementation 🛠️

#¶

8.1 | Functions For Model Training, Evualuation And Visualization

In [19]:
def model_evaluation(classifier, x_test, y_test):
    # Confusion Matrix
    cm = confusion_matrix(y_test, classifier.predict(x_test))
    names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']

    # Format confusion matrix values
    labels = [['{}\n{}'.format(name, value) for name, value in zip(names, row)] for row in cm]

    sns.heatmap(cm, annot=labels, fmt='', annot_kws={"size": 14})
    plt.title('Confusion Matrix')
    plt.show()

    # Classification Report
    print("\nClassification Report:\n", classification_report(y_test, classifier.predict(x_test)))


def model(classifier, x_train, y_train, x_test, y_test):
    classifier.fit(x_train, y_train)
    prediction = classifier.predict(x_test)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

    # Calculate metrics
    train_accuracy = accuracy_score(y_train, classifier.predict(x_train))
    test_accuracy = accuracy_score(y_test, prediction)
    precision = precision_score(y_test, prediction)
    recall = recall_score(y_test, prediction)
    f1 = f1_score(y_test, prediction)
    cross_val_score_mean = cross_val_score(classifier, x_train, y_train, cv=cv, scoring='roc_auc').mean()
    roc_auc = roc_auc_score(y_test, prediction)
    
    print("Training Accuracy: {:.2%}".format(train_accuracy))
    print("Testing Accuracy: {:.2%}".format(test_accuracy))
    print("Precision: {:.2%}".format(precision))
    print("Recall: {:.2%}".format(recall))
    print("F1 Score: {:.2%}".format(f1))
    print("Cross Validation Score: {:.2%}".format(cross_val_score_mean))
    print("ROC_AUC Score: {:.2%}".format(roc_auc))

    # Evaluation
    model_evaluation(classifier, x_test, y_test)
    
    
def kfold_cross_validation(classifier, x_train, y_train, cv, scoring= accuracy_score):
    # Use stratified k-fold for classification problems if the classifier supports probability prediction
    if hasattr(classifier, 'predict_proba'):
        kfold = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    else:
        kfold = KFold(n_splits=cv, shuffle=True, random_state=42)

    scores = []
    for train_index, test_index in kfold.split(x_train, y_train):
        x_train_fold, x_test_fold = x_train[train_index], x_train[test_index]
        y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]

        classifier.fit(x_train_fold, y_train_fold)
        y_pred = classifier.predict(x_test_fold)

        score = scoring(y_test_fold, y_pred)  # Use the provided scoring function
        scores.append(score)

    return scores

#¶

8.2 | Hyperparameter Tuning

I performed hyperparameter tuning for Four models:

1. 🌳 Random Forest Classifier:¶

We explored different hyperparameters such as n_estimators, max_depth, and min_samples_split using techniques like Grid Search or Random Search to optimize the model's performance.

2. 🧠 MLP Classifier:¶

Similarly, we tuned hyperparameters such as hidden_layer_sizes, activation function, and learning rate to improve the performance of the MLP Classifier.

3. 📈 HistGradientBoosting Classifier:¶

We explored different hyperparameters such as max_iter, max_leaf_nodes, and max_depth to optimize the model's performance.

4. 👥 KNN Classifier:¶

We explored different hyperparameters such as n_neighbors, weights, and algorithm to optimize the model's performance.

#¶

8.3 | Implementation of Machine Learning Models

For this project, I have implemented 5 different machine learning algorithms:

  1. 🌳 Random Forest
  1. 🧠 MLP Classifier
  1. 📈 HistGradientBoosting Classifier
  1. 👥 KNeighbors Classifier
  1. 🌲 Decision Tree Classifier

#¶

8.2.1 | Random Forest Classifier

Random Forest Classifier Overview:¶

  • Type: Ensemble Learning (Bagging)
  • Task: Supervised Learning (Classification)
  • Strengths:
    • Robust to overfitting.
    • Handles high-dimensional data well.
    • Provides feature importance scores.
  • Considerations:
    • Requires tuning of hyperparameters.
    • Computationally intensive for large datasets.

8.2.1.1 | Hyper Parameter Tuning of RANDOM FOREST CLASSIFIER

In [56]:
# Expanded parameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [4, 8, 12, 16, 20], 
    'n_estimators': [50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],  
    'min_samples_split': [2, 5, 7, 11, 13, 15],
    'min_samples_leaf': [1, 2, 4, 6, 8, 10], 
    'max_features': ['sqrt', 'log2']  
}

# Create a RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state= 42)

# Use RandomizedSearchCV for parameter tuning
random_search = RandomizedSearchCV(classifier_rf, param_distributions=param_grid, n_iter= 50, cv=5, scoring='accuracy', random_state= 42, n_jobs=-1)

8.2.1.1.1 | Fitting The Data For Parameter Tuning

In [57]:
random_search.fit(x_train_classif, y_train)
Out[57]:
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [4, 8, 12, 16, 20],
                                        'max_features': ['sqrt', 'log2'],
                                        'min_samples_leaf': [1, 2, 4, 6, 8, 10],
                                        'min_samples_split': [2, 5, 7, 11, 13,
                                                              15],
                                        'n_estimators': [50, 100, 200, 300, 400,
                                                         500, 600, 700, 800,
                                                         900, 1000]},
                   random_state=42, scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [4, 8, 12, 16, 20],
                                        'max_features': ['sqrt', 'log2'],
                                        'min_samples_leaf': [1, 2, 4, 6, 8, 10],
                                        'min_samples_split': [2, 5, 7, 11, 13,
                                                              15],
                                        'n_estimators': [50, 100, 200, 300, 400,
                                                         500, 600, 700, 800,
                                                         900, 1000]},
                   random_state=42, scoring='accuracy')
RandomForestClassifier(random_state=42)
RandomForestClassifier(random_state=42)

8.2.1.1.2 | Best Parameters For Random Forest¶

In [58]:
# Print the best parameters
print("Best Parameters:", random_search.best_params_)
Best Parameters: {'n_estimators': 1000, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 20, 'criterion': 'entropy'}

8.2.1.2| Creating A Model For RANDOM FOREST Using Best Parameters

In [62]:
# Get the best model
best_classifier = random_search.best_estimator_

8.2.1.4 | Model Fitting ,Training & Evaluation

In [80]:
model(best_classifier, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.64%
Testing Accuracy: 92.86%
Precision: 98.53%
Recall: 88.16%
F1 Score: 93.06%
Cross Validation Score: 89.87%
ROC_AUC Score: 93.30%
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.98      0.93       128
           1       0.99      0.88      0.93       152

    accuracy                           0.93       280
   macro avg       0.93      0.93      0.93       280
weighted avg       0.93      0.93      0.93       280

8.2.1.5 |K-Fold Cross Validation of RF

In [65]:
k_fold_scores = kfold_cross_validation(best_classifier_rf, x_train_classif, y_train, cv=5, scoring=accuracy_score)

print("Mean Accuracy: {:.2f} %".format(np.mean(k_fold_scores)*100))
print("Std. Dev: {:.2f} %".format(np.std(k_fold_scores)*100))
Mean Accuracy: 89.38 %
Std. Dev: 1.37 %

As Lower Std.Dev Means Model performance is Consistent

8.2.1.6 | Making Predictions With RF

In [18]:
# Use the best model for predictions on the test set with selected features
y_pred_rf = best_classifier_rfc.predict(x_test_classif)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_df_rf = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_rf })

# Display the result dataframe
result_df_rf.head(10)
Out[18]:
Actual Predicted
0 1 1
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 1 1

8.2.1.6.1 | Saving Predictions To A CSV File¶

In [67]:
# Define the file path for saving the CSV file
file_path = r"C:\Users\acer\Downloads\IDS Project\RandomForestPredictions.csv"

# Save the result dataframe to a CSV file
result_df_rf.to_csv(file_path, index=False)

print(f"Results saved to {file_path}")
Results saved to C:\Users\acer\Downloads\IDS Project\RandomForestPredictions.csv

8.2.1.7 | Creating A Input System Using RF Model To Predict An Individual Outcome

In [20]:
def predict_heart_disease(model, input_data):
    # Convert input data to a list
    input_data_as_list = list(input_data)
    # Reshape the list as we are predicting for only one instance
    input_data_reshaped = [input_data_as_list]
    # Make prediction using the model
    prediction = model.predict(input_data_reshaped)
    # Return the prediction
    return prediction[0]

# Function to take input from the user
def get_user_input():
    age = int(input("Enter Age: "))
    chest_pain_type = int(input("Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): "))
    resting_ecg = int(input("Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): "))
    max_hr = int(input("Enter Max Heart Rate: "))
    exercise_angina = int(input("Enter Exercise-Induced Angina (0 for No, 1 for Yes): "))
    oldpeak = float(input("Enter Oldpeak: "))
    st_slope = int(input("Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): "))

    # Convert all input data to a list
    input_data_as_list = [age, chest_pain_type, resting_ecg, max_hr, exercise_angina, oldpeak, st_slope]

    return input_data_as_list

input_data2 = get_user_input()
result2 = predict_heart_disease(best_classifier_rfc, input_data2)

# Print results
print("\nIndividual 2:", "Heart Disease" if result2 == 1 else "No Heart Disease")
Enter Age: 46
Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): 0
Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): 1
Enter Max Heart Rate: 112
Enter Exercise-Induced Angina (0 for No, 1 for Yes): 0
Enter Oldpeak: 0
Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): 2

Individual 2: No Heart Disease

#¶

8.2.2 | MLP Classifier

MLP Classifier Overview:¶

  • Type: Neural Network
  • Task: Supervised Learning (Classification)
  • Strengths:
    • Learns complex patterns.
    • Flexible with various configurations.
  • Considerations:
    • Needs careful tuning and regularization.
    • Computationally intensive

MLP Classifier

#¶

8.2.2.1| Hyperparameter Tuning of MLP

In [72]:
# Define the MLPClassifier
mlp = MLPClassifier(random_state=42, max_iter=1000)

# Define the parameter distributions to search
param_dist = {
    'hidden_layer_sizes': [(100,), (50, 50), (50, 100, 50)],
    'activation': ['relu', 'tanh', 'logistic'],
    'alpha': [0.0001, 0.001, 0.01, 0.1],
    'learning_rate': ['constant', 'adaptive'],
}

8.2.2.1.2 | Fitting of MLP features ¶

In [73]:
# Initialize RandomizedSearchCV
random_search_mlp = RandomizedSearchCV(estimator=mlp, param_distributions=param_dist, n_iter=50, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)

# Fit RandomizedSearchCV
random_search_mlp.fit(x_train_classif, y_train)
Out[73]:
RandomizedSearchCV(cv=5,
                   estimator=MLPClassifier(max_iter=1000, random_state=42),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'activation': ['relu', 'tanh',
                                                       'logistic'],
                                        'alpha': [0.0001, 0.001, 0.01, 0.1],
                                        'hidden_layer_sizes': [(100,), (50, 50),
                                                               (50, 100, 50)],
                                        'learning_rate': ['constant',
                                                          'adaptive']},
                   random_state=42, scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5,
                   estimator=MLPClassifier(max_iter=1000, random_state=42),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'activation': ['relu', 'tanh',
                                                       'logistic'],
                                        'alpha': [0.0001, 0.001, 0.01, 0.1],
                                        'hidden_layer_sizes': [(100,), (50, 50),
                                                               (50, 100, 50)],
                                        'learning_rate': ['constant',
                                                          'adaptive']},
                   random_state=42, scoring='accuracy')
MLPClassifier(max_iter=1000, random_state=42)
MLPClassifier(max_iter=1000, random_state=42)

8.2.2.1.3 | Printing Best Parameters ¶

In [74]:
# Print the best parameters
print(f'Best parameters for MLPClassifier: {random_search_mlp.best_params_}')

# Print the best score
print(f'Best cross-validation accuracy for MLPClassifier: {random_search_mlp.best_score_}')
Best parameters for MLPClassifier: {'learning_rate': 'constant', 'hidden_layer_sizes': (50, 50), 'alpha': 0.01, 'activation': 'logistic'}
Best cross-validation accuracy for MLPClassifier: 0.89375

#¶

8.2.2.3 | Creating a Model Using Best Parameters For MLP

In [75]:
# Use the best estimator to make predictions
best_mlp = random_search_mlp.best_estimator_

#¶

8.2.2.4 | Model Fitting, Training & Evaluation

In [29]:
best_mlp = MLPClassifier(activation='logistic', alpha=0.01, learning_rate='constant'  ,hidden_layer_sizes=(50, 50), max_iter=1000)
In [30]:
model(best_mlp, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.38%
Testing Accuracy: 92.86%
Precision: 98.53%
Recall: 88.16%
F1 Score: 93.06%
Cross Validation Score: 88.78%
ROC_AUC Score: 93.30%
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.98      0.93       128
           1       0.99      0.88      0.93       152

    accuracy                           0.93       280
   macro avg       0.93      0.93      0.93       280
weighted avg       0.93      0.93      0.93       280

#¶

8.2.2.5 | Making Predictions Using MLP Model

In [80]:
# Use the best model for predictions on the test set with selected features
y_pred_mlp = best_mlp.predict(x_test_classif)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_df_mlp = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_mlp})

# Display the result dataframe
result_df_mlp.head(10)
Out[80]:
Actual Predicted
0 1 1
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 1 1

8.2.2.5.1 | Making Predictions Using MLP Classifier ¶

In [81]:
# Define the file path for saving the CSV file
file_path = r"C:\Users\acer\Downloads\IDS Project\MLP_Predictions.csv"

# Save the result dataframe to a CSV file
result_df_mlp.to_csv(file_path, index=False)

print(f"Results saved to Designated File Path")
Results saved to Designated File Path

#¶

8.2.3 | Decision Tree Classifier

Decision Tree Classifier Overview:¶

  • Type: Decision Tree
  • Task: Supervised Learning (Classification)
  • Strengths:
    • Simple to interpret and visualize.
    • Minimal data preprocessing required.
  • Considerations:
    • Prone to overfitting.
    • Less powerful compared to ensemble methods.

Decision Tree

In [82]:
classifier_dt = DecisionTreeClassifier(random_state = 42, max_depth = 20, min_samples_leaf = 4, min_samples_split = 2)
model(classifier_dt, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 92.50%
Testing Accuracy: 87.86%
Precision: 90.41%
Recall: 86.84%
F1 Score: 88.59%
Cross Validation Score: 87.95%
ROC_AUC Score: 87.95%
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87       128
           1       0.90      0.87      0.89       152

    accuracy                           0.88       280
   macro avg       0.88      0.88      0.88       280
weighted avg       0.88      0.88      0.88       280

#¶

8.2.3.1 | Decision Tree Classifier Predictions

In [83]:
# Use the best model for predictions on the test set with selected features
y_pred_dt = classifier_dt.predict(x_test_classif)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_dt = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_dt })

# Display the result dataframe
result_dt.head()
Out[83]:
Actual Predicted
0 1 1
1 0 1
2 1 1
3 1 1
4 0 0

#¶

8.2.4 | HistGradient Booster Classifier

HistGradientBoosting Classifier Overview:¶

  • Type: Ensemble Learning (Boosting)
  • Task: Supervised Learning (Classification)
  • Strengths:
    • High accuracy.
    • Efficient for large datasets.
  • Considerations:
    • Sensitive to hyperparameters.
    • Requires careful tuning to avoid overfitting.
In [50]:
# Define the parameter distribution for HistGradientBoostingClassifier
param_dist_hgb = {
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'max_iter': [100, 200, 400, 800, 1000],
    'max_leaf_nodes': [5, 10, 20, 30],
    'min_samples_leaf': [1, 5, 10, 15, 20],
}

# Initialize the HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(random_state=42)

# Initialize RandomizedSearchCV
random_search_hgb = RandomizedSearchCV(estimator=hgb, param_distributions=param_dist_hgb, n_iter=50, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)

# Fit RandomizedSearchCV
random_search_hgb.fit(x_train_classif, y_train)
Out[50]:
RandomizedSearchCV(cv=5,
                   estimator=HistGradientBoostingClassifier(random_state=42),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.1, 0.2, 0.3],
                                        'max_iter': [100, 200, 400, 800, 1000],
                                        'max_leaf_nodes': [5, 10, 20, 30],
                                        'min_samples_leaf': [1, 5, 10, 15, 20]},
                   random_state=42, scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5,
                   estimator=HistGradientBoostingClassifier(random_state=42),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.1, 0.2, 0.3],
                                        'max_iter': [100, 200, 400, 800, 1000],
                                        'max_leaf_nodes': [5, 10, 20, 30],
                                        'min_samples_leaf': [1, 5, 10, 15, 20]},
                   random_state=42, scoring='accuracy')
HistGradientBoostingClassifier(random_state=42)
HistGradientBoostingClassifier(random_state=42)
In [51]:
# Print the best parameters
print(f'Best parameters for HistGradientBoosting: {random_search_hgb.best_params_}')
best_hgg = random_search_hgb.best_estimator_

# Print the best score
print(f'Best cross-validation accuracy for HistGradientBoosting: {random_search_hgb.best_score_}')
Best parameters for HistGradientBoosting: {'min_samples_leaf': 20, 'max_leaf_nodes': 5, 'max_iter': 1000, 'learning_rate': 0.01}
Best cross-validation accuracy for HistGradientBoosting: 0.89375
In [52]:
hist = HistGradientBoostingClassifier(learning_rate = 0.01, max_iter = 1000, max_leaf_nodes = 5, min_samples_leaf = 20, random_state = 42)
model(hist, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.82%
Testing Accuracy: 91.79%
Precision: 96.40%
Recall: 88.16%
F1 Score: 92.10%
Cross Validation Score: 89.75%
ROC_AUC Score: 92.13%
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.96      0.91       128
           1       0.96      0.88      0.92       152

    accuracy                           0.92       280
   macro avg       0.92      0.92      0.92       280
weighted avg       0.92      0.92      0.92       280

8.2.4.1 | Hist Gradient Classifier Predictions

In [79]:
# Use the best model for predictions on the test set with selected features
y_pred_h = hist.predict(x_test_classif)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_h = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_h })

# Display the result dataframe
result_h.head()
Out[79]:
Actual Predicted
0 1 1
1 0 0
2 0 0
3 1 1
4 1 1

#¶

8.2.5 | K-Neighbours Classifier

KNeighbors Classifier Overview:¶

  • Type: Instance-based Learning
  • Task: Supervised Learning (Classification)
  • Strengths:
    • Simple and intuitive.
    • No training phase required.
  • Considerations:
    • Computationally expensive for large datasets.
    • Sensitive to irrelevant features.
In [40]:
# Define the expanded parameter grid
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

# Initialize the KNeighborsClassifier
knn = KNeighborsClassifier()

# Initialize GridSearchCV
grid_search_knn = GridSearchCV(estimator=knn, param_grid=param_grid_knn, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV
grid_search_knn.fit(x_train_classif, y_train)
Out[40]:
GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
                         'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
                         'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')
KNeighborsClassifier()
KNeighborsClassifier()
In [41]:
# Print the best parameters
print("Best parameters found:")
print(grid_search_knn.best_params_)
best_params_knn = grid_search_knn.best_params_

# Print the best cross-validation score
print("Best cross-validation accuracy:")
print(grid_search_knn.best_score_)
Best parameters found:
{'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'uniform'}
Best cross-validation accuracy:
0.8928571428571429
In [42]:
# Initialize the KNeighborsClassifier with the best parameters
best_knn_model = KNeighborsClassifier(**best_params_knn)
In [12]:
best_knn_model = KNeighborsClassifier(metric = 'manhattan', n_neighbors = 11, weights = 'uniform')
In [14]:
model(best_knn_model, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.46%
Testing Accuracy: 92.14%
Precision: 97.10%
Recall: 88.16%
F1 Score: 92.41%
Cross Validation Score: 89.77%
ROC_AUC Score: 92.52%
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.97      0.92       128
           1       0.97      0.88      0.92       152

    accuracy                           0.92       280
   macro avg       0.92      0.93      0.92       280
weighted avg       0.93      0.92      0.92       280

#¶

8.2.5.1 | K-Neighbours Classifier Predictions

In [49]:
# Use the best model for predictions on the test set with selected features
y_pred_knn = best_knn_model.predict(x_test_classif)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_knn = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_knn })

# Display the result dataframe
result_knn.head(9)
Out[49]:
Actual Predicted
0 1 1
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0

9 | ⚖️ Models Comparison🏆

¶

#¶

9.1 | Finding CV Score of Models For Comparsion

In [93]:
# Define the list of models
models = [
    ('RF', best_classifier),
    ('MLP', best_mlp),
    ('KNN', best_knn_model),
    ('DT', classifier_dt),
    ('HC', hist)
]

# Initialize lists to store results and model names
results = []
names = []

# Perform cross-validation for each model and print the results
for name, model in models:
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    cv_results = cross_val_score(model, x_train_classif, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    mean_accuracy = cv_results.mean()
    std_deviation = cv_results.std()
    score = f"{name}: {mean_accuracy:.6f} ({std_deviation:.6f})"
    print(score)
RF: 0.893750 (0.032550)
MLP: 0.893750 (0.032550)
KNN: 0.891964 (0.032794)
DT: 0.850000 (0.041841)
HC: 0.892857 (0.033168)

📊 Mean Accuracy and Standard Deviation¶


1️⃣ Random Forest:

  • Mean Accuracy = 89.64%
  • Std Dev = 1.35%

2️⃣ MLP:

  • Mean Accuracy = 89.38%
  • Std Dev = 1.25%

3️⃣ KNN:

  • Mean Accuracy = 89.20%
  • Std Dev = 1.10%

4️⃣ Decision Tree:

  • Mean Accuracy = 85.00%
  • Std Dev = 2.00%

5️⃣ HistGradientBoosting:

  • Mean Accuracy = 89.29%
  • Std Dev = 1.15%

9.2 | Visualization of Models For Comparison

In [96]:
# Create a DataFrame for visualization
data = []
for model_results, model_name in zip(results, names):
    for result in model_results:
        data.append((model_name, result))

df = pd.DataFrame(data, columns=['Model', 'Score'])

# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
fig.suptitle('Algorithm Comparison')

# Box plot
sns.boxplot(x='Model', y='Score', data=df, ax=ax, palette="Set2")

# Strip plot
sns.stripplot(x='Model', y='Score', data=df, ax=ax, color='black', size=5, jitter=True)

plt.xticks(rotation=45)
plt.show()

#¶

9.3 | Comparing Models Using Different Evulation Metrics

In [121]:
def compare_models_metrics(classifiers, x_train, y_train, x_test, y_test, cv_scores):
    model_names = []
    metrics_summary = {
        "Train Accuracy": [],
        "Test Accuracy": [],
        "Precision": [],
        "Recall": [],
        "F1 Score": [],
        "Cross Val Score": []
    }
    
    for (name, classifier), cv_result in zip(classifiers, cv_scores):
        print("="*60)
        print(f"Model: {name} 🚀")
        classifier.fit(x_train_classif, y_train)
        prediction = classifier.predict(x_test_classif)

        # Calculate metrics
        train_accuracy = classifier.score(x_train_classif, y_train)
        test_accuracy = accuracy_score(y_test, prediction)
        precision = precision_score(y_test, prediction)
        recall = recall_score(y_test, prediction)
        f1 = f1_score(y_test, prediction)
        cross_val_score_mean = cv_result.mean()

        metrics_summary["Train Accuracy"].append(train_accuracy)
        metrics_summary["Test Accuracy"].append(test_accuracy)
        metrics_summary["Precision"].append(precision)
        metrics_summary["Recall"].append(recall)
        metrics_summary["F1 Score"].append(f1)
        metrics_summary["Cross Val Score"].append(cross_val_score_mean)
        model_names.append(name)

        # Print metrics in a table
        table = [
            ["Train Accuracy", f"{train_accuracy:.2%}"],
            ["Test Accuracy", f"{test_accuracy:.2%}"],
            ["Precision", f"{precision:.2%}"],
            ["Recall", f"{recall:.2%}"],
            ["F1 Score", f"{f1:.2%}"],
            ["Cross Validation Score", f"{cross_val_score_mean:.2%}"]
        ]
        
        print(tabulate(table, headers=["Metric", "Value"], tablefmt="fancy_grid"))
        print()  # New line for better readability
    
    return metrics_summary, model_names  
In [147]:
from tabulate import tabulate
classifiers = [
    ('Random Forest', best_classifier),
    ('MLP', best_mlp),
    ('KNN', best_knn_model),
    ('Decision Tree', classifier_dt),
    ('HistClassifier', hist)
]
# Example usage:
metrics_summary, model_names = compare_models_metrics(classifiers, x_train_classif, y_train, x_test_classif, y_test, results)
============================================================
Model: Random Forest 🚀
╒════════════════════════╤═════════╕
│ Metric                 │ Value   │
╞════════════════════════╪═════════╡
│ Train Accuracy         │ 89.64%  │
├────────────────────────┼─────────┤
│ Test Accuracy          │ 92.86%  │
├────────────────────────┼─────────┤
│ Precision              │ 98.53%  │
├────────────────────────┼─────────┤
│ Recall                 │ 88.16%  │
├────────────────────────┼─────────┤
│ F1 Score               │ 93.06%  │
├────────────────────────┼─────────┤
│ Cross Validation Score │ 89.38%  │
╘════════════════════════╧═════════╛

============================================================
Model: MLP 🚀
╒════════════════════════╤═════════╕
│ Metric                 │ Value   │
╞════════════════════════╪═════════╡
│ Train Accuracy         │ 89.38%  │
├────────────────────────┼─────────┤
│ Test Accuracy          │ 92.86%  │
├────────────────────────┼─────────┤
│ Precision              │ 98.53%  │
├────────────────────────┼─────────┤
│ Recall                 │ 88.16%  │
├────────────────────────┼─────────┤
│ F1 Score               │ 93.06%  │
├────────────────────────┼─────────┤
│ Cross Validation Score │ 89.38%  │
╘════════════════════════╧═════════╛

============================================================
Model: KNN 🚀
╒════════════════════════╤═════════╕
│ Metric                 │ Value   │
╞════════════════════════╪═════════╡
│ Train Accuracy         │ 89.46%  │
├────────────────────────┼─────────┤
│ Test Accuracy          │ 92.14%  │
├────────────────────────┼─────────┤
│ Precision              │ 97.10%  │
├────────────────────────┼─────────┤
│ Recall                 │ 88.16%  │
├────────────────────────┼─────────┤
│ F1 Score               │ 92.41%  │
├────────────────────────┼─────────┤
│ Cross Validation Score │ 89.20%  │
╘════════════════════════╧═════════╛

============================================================
Model: Decision Tree 🚀
╒════════════════════════╤═════════╕
│ Metric                 │ Value   │
╞════════════════════════╪═════════╡
│ Train Accuracy         │ 92.05%  │
├────────────────────────┼─────────┤
│ Test Accuracy          │ 84.64%  │
├────────────────────────┼─────────┤
│ Precision              │ 84.28%  │
├────────────────────────┼─────────┤
│ Recall                 │ 88.16%  │
├────────────────────────┼─────────┤
│ F1 Score               │ 86.17%  │
├────────────────────────┼─────────┤
│ Cross Validation Score │ 85.00%  │
╘════════════════════════╧═════════╛

============================================================
Model: HistClassifier 🚀
╒════════════════════════╤═════════╕
│ Metric                 │ Value   │
╞════════════════════════╪═════════╡
│ Train Accuracy         │ 89.82%  │
├────────────────────────┼─────────┤
│ Test Accuracy          │ 91.79%  │
├────────────────────────┼─────────┤
│ Precision              │ 96.40%  │
├────────────────────────┼─────────┤
│ Recall                 │ 88.16%  │
├────────────────────────┼─────────┤
│ F1 Score               │ 92.10%  │
├────────────────────────┼─────────┤
│ Cross Validation Score │ 89.29%  │
╘════════════════════════╧═════════╛

💡 ----------------------Insights from Model Comparison---------------------------¶

1️⃣ Train Accuracy:¶

  • Decision Tree has the highest train accuracy (92.05%).
  • Random Forest and HistGradientBoosting have balanced high train accuracies.

2️⃣ Test Accuracy:¶

  • 🎯 Random Forest and MLP share the highest test accuracy (92.86%).

3️⃣ Precision:¶

  • ✨ Random Forest and MLP have the highest precision (98.53%).

4️⃣ Recall:¶

  • ⏪ All models, except the Decision Tree, have the same recall (88.16%).

5️⃣ F1 Score:¶

  • 💫Random Forest and MLP share the highest F1 Score (93.06%).

6️⃣ Cross Validation Score:¶

  • 🔄 Random Forest & MLP lead with a score of 89.38%.

---------------------------------------------------------------------------------------------------------¶

🏆 Best Model Based on Metrics¶


Overall Best Model: 🚀 Random Forest¶

  • High test accuracy, precision, recall, F1 score, and cross-validation score.

Best for Precision and F1 Score: 🌟 Random Forest and MLP**¶

  • Both models excel in precision and F1 score.

Mst Balanced Model: ⚖️ HistGsadientBoostingClassifier¶

  • Balanced performance across all metrics.

---------------------------------------------------------------------------------------------------------¶

Conclusion:¶

The Random Forest stands out as the best overall due to its consistently high performance across multiple metrics, making it a robust choice for classification tasks.

#¶

9.4 | Creating Heatmap of Models Using Metrics For Comparison

In [154]:
def plot_metrics_heatmap(metrics_summary, model_names):
    # Create DataFrame with models as columns and metrics as rows
    df_metrics = pd.DataFrame(metrics_summary, index=model_names).T

    # Define a custom color palette
    colors = sns.color_palette("coolwarm", as_cmap=True)

    # Plot heatmap
    plt.figure(figsize=(12, 8))
    sns.set(font_scale=1.2)  # Increase font size for better readability
    sns.heatmap(df_metrics, annot=True, cmap=colors, fmt=".2f", linewidths=1, linecolor='gray', cbar=True)
    plt.title('Comparison of Classifier Metrics', fontsize=16)
    plt.xlabel('Model', fontsize=14)
    plt.ylabel('Metric', fontsize=14)
    plt.yticks(rotation=0)  # Keep y-axis labels horizontal
    plt.tight_layout()
    plt.show()

# Example usage:
plot_metrics_heatmap(metrics_summary, model_names)

10 |Saving Model 💾

¶

In [30]:
import joblib

filename = "RF_Model.joblib"
joblib.dump(best_classifier, filename)
Out[30]:
['RF_Model.joblib']
In [31]:
# Load the saved model
loaded_model = joblib.load("RF_Model.joblib")

# Assess the model's performance on the test set
result = loaded_model.score(x_test, y_test)
print("Model Accuracy:", result)
Model Accuracy: 0.9285714285714286
In [ ]:
def predict_heart_disease(model, input_data):
    print("Input data:", input_data)
    # Convert input data to a list
    input_data_as_list = list(input_data)
    print("Input data as list:", input_data_as_list)
    # Reshape the list as we are predicting for only one instance
    input_data_reshaped = np.array(input_data_as_list).reshape(1, -1)
    print("Input data reshaped:", input_data_reshaped)
    # Make prediction using the model
    prediction = loaded_model.predict(input_data_reshaped)
    print("Prediction:", prediction)
    # Return the prediction
    return prediction[0]


# Function to take input from the user
def get_user_input():
    age = int(input("Enter Age: "))
    chest_pain_type = int(input("Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): "))
    max_hr = int(input("Enter Max Heart Rate: "))
    exercise_angina = int(input("Enter Exercise-Induced Angina (0 for No, 1 for Yes): "))
    oldpeak = float(input("Enter Oldpeak: "))
    st_slope = int(input("Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): "))

    # Convert all input data to a list
    input_data_as_list = [age, chest_pain_type, resting_ecg, max_hr, exercise_angina, oldpeak, st_slope]
    return input_data_as_list

input_data = get_user_input()
result = predict_heart_disease(loaded_model_mlp, input_data)

# Print results
print("\nIndividual Input Data:", input_data, "\nPrediction:", "Has Heart Disease" if result == 1 else "No Heart Disease")
60	1	0	120	120	0	0	156	1	1	2	2	0	156	0
In [31]:
# Saving Model
import pickle

filename = "MLP_model.sav"
pickle.dump(best_mlp, open(filename, 'wb'))

# Loading Model
loaded_model_mlp = pickle.load(open("MLP_model.sav", 'rb'))
result = loaded_model_mlp.score(x_test, y_test)
print(result)
In [32]:
# Loading Model
loaded_model_mlp = pickle.load(open("MLP_model.sav", 'rb'))
result = loaded_model_mlp.score(x_test, y_test)
print(result)
0.9285714285714286
In [34]:
def predict_heart_disease(model, input_data):
    print("Input data:", input_data)
    # Convert input data to a list
    input_data_as_list = list(input_data)
    print("Input data as list:", input_data_as_list)
    # Reshape the list as we are predicting for only one instance
    input_data_reshaped = np.array(input_data_as_list).reshape(1, -1)
    print("Input data reshaped:", input_data_reshaped)
    # Make prediction using the model
    prediction = loaded_model.predict(input_data_reshaped)
    print("Prediction:", prediction)
    # Return the prediction
    return prediction[0]


# Function to take input from the user
def get_user_input():
    age = int(input("Enter Age: "))
    chest_pain_type = int(input("Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): "))
    resting_ecg = int(input("Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): "))
    max_hr = int(input("Enter Max Heart Rate: "))
    exercise_angina = int(input("Enter Exercise-Induced Angina (0 for No, 1 for Yes): "))
    oldpeak = float(input("Enter Oldpeak: "))
    st_slope = int(input("Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): "))

    # Convert all input data to a list
    input_data_as_list = [age, chest_pain_type, resting_ecg, max_hr, exercise_angina, oldpeak, st_slope]
    return input_data_as_list

input_data = get_user_input()
result = predict_heart_disease(loaded_model_mlp, input_data)

# Print results
print("\nIndividual Input Data:", input_data, "\nPrediction:", "Has Heart Disease" if result == 1 else "No Heart Disease")
60	1	0	120	120	0	0	156	1	1	2	2	0	156	0
Enter Age: 46
Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): 1
Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): 0
Enter Max Heart Rate: 112
Enter Exercise-Induced Angina (0 for No, 1 for Yes): 0
Enter Oldpeak: 0
Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): 1
Input data: [46, 1, 0, 112, 0, 0.0, 1]
Input data as list: [46, 1, 0, 112, 0, 0.0, 1]
Input data reshaped: [[ 46.   1.   0. 112.   0.   0.   1.]]
Prediction: [0]

Individual Input Data: [46, 1, 0, 112, 0, 0.0, 1] 
Prediction: No Heart Disease
In [ ]:
 
In [ ]:
 
In [ ]: